INTERSPEECH.2010 - Speech Synthesis | Cool Papers

#1 A classifier-based target cost for unit selection speech synthesis trained on perceptual data [PDF] [Copy] [Kimi¹]

Our goal is to automatically learn a perceptually-optimal target cost function for a unit selection speech synthesiser. The approach we take here is to train a classifier on human perceptual judgements of synthetic speech. The output of the classifier is used to make a simple three-way distinction rather than to estimate a continuously-valued cost.In order to collect the necessary perceptual data, we synthesised 145,137 short sentences with the usual target cost switched off, so that the search was driven by the join cost only. We then selected the 7200 sentences with the best joins and asked 60 listeners to judge them, providing their ratings for each syllable. From this, we derived a rating for each demiphone. Using as input the same context features employed in our conventional target cost function, we trained a classifier on these human perceptual ratings.We synthesised two sets of test sentences with both our standard target cost and the new target cost based on the classifier. A/B preference tests showed that the classifier-based target cost, which was learned completely automatically from modest amounts of perceptual data, is almost as good as our carefully- and expertly-tuned standard target cost.

#2 Applying scalable phonetic context similarity in unit selection of concatenative text-to-speech [PDF] [Copy] [Kimi¹]

Authors: Wei Zhang ; Xiaodong Cui

This paper presents an approach using phonetic context similarity as a cost function in unit selection of concatenative Text-to- Speech. The approach measures the degree of similarity between the desired context and the candidate segment under different phonetic contexts. It considers the impact from relatively far contexts when plenty of candidates are available and can take advantage of the data from other symbolically different contexts when the candidates are sparse. Moreover, the cost function also provides an efficient way to prune the search space. Different parameters for modeling, normalization and integerization are discussed. MOS evaluation shows that it can improve the synthesis quality significantly.

#3 Speech database reduction method for corpus-based TTS system [PDF] [Copy] [Kimi¹]

Authors: Mitsuaki Isogai ; Hideyuki Mizuno

We propose a new speech database reduction method that can create efficient speech databases for concatenation-type corpus-based TTS systems. Our aim is to create small speech databases that can yield the highest quality speech output possible. The main points of proposed method are as follows; (1) It has a 2-stage algorithm to reduce speech database size. (2) Consideration of the real speech elements needed allows us to select the most suitable subset of a full-size database; this yields scalable downsized speech databases. A listening test shows that proposed method can reduced a database from 13 hours to 10 hours with no degradation in output quality. Furthermore, synthesized speech using database sizes of 8 and 6 hours keeps relatively high MOS of more than 3.5; 95% of MOS using full size database.

#4 Automatic error detection for unit selection speech synthesis using log likelihood ratio based SVM classifier [PDF] [Copy] [Kimi¹]

Authors: Heng Lu ; Zhen-Hua Ling ; Si Wei ; Lirong Dai ; Ren-Hua Wang

This paper proposes a method to detect the errors in synthetic speech of a unit selection speech synthesis system automatically using log likelihood ratio and support vector machine (SVM). For SVM training, a set of synthetic speech are firstly generated by a given speech synthesis system and their synthetic errors are labeled by manually annotating the segments that sound unnatural. Then, two context-dependent acoustic models are trained using the natural and unnatural segments of labeled synthetic speech respectively. The log likelihood ratio of acoustic features between these two models is adopted to train the SVM classifier for error detection. Experimental results show the proposed method is effective in detecting the errors of pitch contour within a word for a Mandarin speech synthesis system. The proposed SVM method using log likelihood ratio between context-dependent acoustic models outperforms the SVM classifier trained on acoustic features directly.

#5 Using robust viterbi algorithm and HMM-modeling in unit selection TTS to replace units of poor quality [PDF] [Copy] [Kimi¹]

Authors: Hanna Silén ; Elina Helander ; Jani Nurminen ; Konsta Koppinen ; Moncef Gabbouj

In hidden Markov model-based unit selection synthesis, the benefits of both unit selection and statistical parametric speech synthesis are combined. However, conventional Viterbi algorithm is forced to do a selection also when no suitable units are available. This can drift the search and decrease the overall quality. Consequently, we propose to use robust Viterbi algorithm that can simultaneously detect bad units and select the best sequence. The unsuitable units are replaced using hidden Markov model-based synthesis. Evaluations indicate that the use of robust Viterbi algorithm combined with unit replacement increases the quality compared to the traditional algorithm.

#6 Automatic detection of abnormal stress patterns in unit selection synthesis [PDF] [Copy] [Kimi¹]

Authors: Yeon-Jun Kim ; Mark C. Beutnagel

This paper introduces a method to detect lexical stress errors in unit selection synthesis automatically using machine learning algorithms. If unintended stress patterns can be detected following unit selection, based on features available in the unit database, it may be possible to modify the units during waveform synthesis to correct errors and produce an acceptable stress pattern. In this paper, three machine learning algorithms were trained with acoustic measurements from natural utterances and corresponding stress patterns: CART, SVM and MaxEnt. Our experimental results showed that MaxEnt performs the best (83.3% for 3-syllable words, 88.7% for 4-syllable words correctly classified) in the natural stress pattern classification. Though classification rates are good, a large number of false alarms are produced. However, there is some indication that signal modifications based on false positives do little harm to the speech output.

#7 Enhancements of viterbi search for fast unit selection synthesis [PDF] [Copy] [Kimi¹]

Authors: Daniel Tihelka ; Jiří Kala ; Jindřich Matoušek

The paper describes the optimisation of Viterbi search used in unit selection TTS, since with a large speech corpus necessary to achieve a high level of naturalness, the performace still suffers. To improve the search speed, the combination of sophisticated stopping schemes and pruning thresholds is employed into the baseline search. The optimised search is, moreover, extremely flexible in configuration, requiring only three intuitively comprehensible coefficients to be set. This provides the means for tuning the search depending on device resources, while it allows reaching significant performance increase. To illustrate it, several configuration scenarios, with speed--up ranging from 6 to 58 times, are presented. Their impact on speech quality is verified by CCR listening test, taking into account only the phrases with the highest number of differences when compared to the baseline search.

#8 Accurate pitch marking for prosodic modification of speech segments [PDF] [Copy] [Kimi¹]

Authors: Thomas Ewender ; Beat Pfister

This paper describes a new approach to pitch marking. Unlike other approaches that use the same combination of features for the whole signal, we take into account the signal properties and apply different features according to some heuristic. We use the short-term energy as a novel robust feature for placing the pitch marks. Where the energy information turns out to be not suitable as an indicator we resort to the fundamental wave computed from a contiguous F0 contour in combination with detailed voicing information. Our experiments demonstrate that the proposed pitch marking algorithm considerably improves the quality of synthesised speech generated by a concatenative text-to-speech system that uses TD-PSOLA for prosodic modifications.

#9 A novel hybrid approach for Mandarin speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Shifeng Pan ; Meng Zhang ; Jianhua Tao

The paper investigates a new method to solve concatenation problems of Mandarin speech synthesis which is based on the hybrid approach of HMM-based speech synthesis and unit selection. Unlike other works which use only boundary F0 errors as concatenation cost, a CART based F0 dependency model which considers much context information is trained to measure smoothness of F0. Instead of phoneme-sized units, the basic units of our HUS system are syllables which has been proved to be better for the prosody stability in Mandarin. The experiments show that the proposed method achieves better performance than conventional hybrid system and unit selection system.

#10 Modeling liaison in French by using decision trees [PDF] [Copy] [Kimi²]

Authors: Josafá de Jesus Aguiar Pontes ; Sadaoki Furui

French is known to be a language with major pronunciation irregularities at word endings with consonants. Particularly, the well-known phonetic phenomenon called Liaison is one of the major issues for French phonetizers. Rule-based methods have been used to solve these issues. Yet, the current models still produce a great number of pronunciation errors to be used in 2nd language learning applications. In addition, the number of rules tends to be large and their interaction complex, making maintenance a problem. In order to try to alleviate such problems, we propose here an approach that, starting from a database (compiled from cases documented in the literature), allows us to build C4.5 decision trees and subsequently, automate the generation of the required rules. A prototype based on our approach has been tested against six other state-of-the-art phonetizers. The comparison shows the prototype system is better than most of them, being equivalent to the second-rank system.

#11 Improvement on plural unit selection and fusion [PDF] [Copy] [Kimi¹]

Authors: Jian Luan ; Jian Li

Plural unit selection and fusion is a successful method for concatenative synthesis. Yet its unit fusion algorithm is simple and requires improvement. Previous research on unit fusion is mainly involved in boundary smoothing and not quite suitable for the application mentioned above. Therefore, a high-quality unit fusion method is proposed in this paper. More accurate pitch frame alignment and primary unit selection are implemented. Besides, the fusion of pitch frames is performed on FFT spectra for less quality loss. Experiment results indicate that the proposed method evidently outperforms the baseline at an overall preference ratio of 54:17.

#12 Improving speech synthesis of machine translation output [PDF] [Copy] [Kimi¹]

Authors: Alok Parlikar ; Alan W. Black ; Stephan Vogel

Speech synthesizers are optimized for fluent natural text. However, in a speech to speech translation system, they have to process machine translation output, which is often not fluent. Rendering machine translations as speech makes them even harder to understand than the synthesis of natural text. A speech synthesizer must deal with the disfluencies in translations in order to be comprehensible and communicate the content. In this paper, we explore three synthesis strategies that address different problems found in translation output. By carrying out listening tasks and measuring transcription accuracies, we find that these methods can make the synthesis of translations more intelligible.

#13 Paraphrase generation to improve text-to-speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Ghislain Putois ; Jonathan Chevelu ; Cédric Boidin

Text-to-speech synthesiser systems are of overall good quality, especially when adapted to a specific task. Given this task and an adapted voice corpus, the message quality is mainly dependent on the wording used. This paper presents how a paraphrase generator can be used in synergy with a text-to-speech synthesis system to improve its overall performances. Our system is composed of a paraphrase generator using a French-to-French corpus learnt on a bilingual aligned corpus, a tts selector based on the unit selection cost, and a tts synthesiser. We present an evaluation of the system, which highlights the need for systematic subjective evaluation.

#14 Speaker and language adaptive training for HMM-based polyglot speech synthesis [PDF] [Copy] [Kimi¹]

Author: Heiga Zen

This paper proposes a technique for speaker and language adaptive training for HMM-based polyglot speech synthesis. Language-specific context-dependencies in the system are captured using CAT with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by CMLLR-based transforms. This framework allows multi-speaker/multi-language adaptive training and synthesis to be performed. Experimental results show that the proposed technique achieves better synthesis performance than both speaker-adaptively trained language-dependent and language-independent models.

#15 Context adaptive training with factorized decision trees for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Kai Yu ; Heiga Zen ; François Mairesse ; Steve Young

To achieve high quality synthesised speech in HMM-based speech synthesis, the effective modelling of complex contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based clustering to model the full contexts. However, weak contexts are difficult to capture using this approach. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and linear transforms represent additional effect of weak-contexts. In contrast to speaker adaptive training, separate decision trees have to be built for the weak and normal context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. The MLLR based system achieved the best performance.

#16 Roles of the average voice in speaker-adaptive HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Junichi Yamagishi ; Oliver Watts ; Simon King ; Bela Usabaev

In speaker-adaptive HMM-based speech synthesis, there are typically a few speakers for which the output synthetic speech sounds worse than that of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates these fluctuations in quality and concludes that as mel-cepstral distance from the average voice becomes larger, the MOS naturalness scores generally become worse. Although this negative correlation is not that strong, it suggests a way to improve the training and adaptation strategies. We also draw comparisons between our findings and the work of other researchers regarding "vocal attractiveness."'

#17 An HMM trajectory tiling (HTT) approach to high quality TTS [PDF] [Copy] [Kimi¹]

Authors: Yao Qian ; Zhi-Jie Yan ; Yijian Wu ; Frank K. Soong ; Xin Zhuang ; Shengyi Kong

The current state-of-art HMM-based speech synthesis can produce highly intelligible speech but still carries the intrinsic vocoding flavor due to its simple excitation model. In this paper, we propose a new HMM trajectory tiling approach to high quality TTS. Trajectory generated by the refined HMM is used to guide the search for the closest waveform segment tiles in rendering highly intelligible and natural sounding speech. Normalized distances between the HMM trajectory and those of waveform unit candidates are used for constructing a unit sausage. Normalized cross-correlation is used to finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to track closely the HMM trajectory guide. Tested on the two British English databases, our approach can render natural sounding speech without sacrificing the high intelligibility achieved by HMM-based TTS. They are confirmed subjectively by the corresponding AB preference and intelligibility tests.

#18 A perceptual study of acceleration parameters in HMM-based TTS [PDF] [Copy] [Kimi¹]

Authors: Yi-Ning Chen ; Zhi-Jie Yan ; Frank K. Soong

Previous study in HMM-based TTS has shown that the acceleration parameters are able to generate smoother trajectories with less distortion. However, the effect has never been investigated in formal objective and subjective tests. In this paper, the acceleration parameters in trajectory generation are studied in depth. We show that discarding acceleration parameters only introduces small additional distortion. But human subjects can easily perceive the quality degradation, because saw-tooth-like trajectories are commonly generated. Therefore, we choose the upper- and lower-bounded envelopes of the saw-tooth trajectories for further analysis. Experimental results show that both envelope trajectories have larger objective distortions. However, the speech synthesized using the envelope trajectories becomes perceptually transparent to the reference. This perceptual study facilitates efficient implementation of low-cost TTS systems, as well as low bit rate speech coding and reconstruction.

#19 Evaluation of prosodic contextual factors for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Shuji Yokomizo ; Takashi Nose ; Takao Kobayashi

We explore the effect of prosodic contextual factors for the HMM-based speech synthesis. In a baseline system, a lot of contextual factors are used during the model training, and the cost for parameter tying by context clustering become relatively high compared to that in the speech recognition. We examine the choice of prosodic contexts by objective measures for English and Japanese speech data. The experimental results show that more compact context sets gives also comparable or close performance to the conventional full context.

#20 Sinusoidal model parameterization for HMM-based TTS system [PDF] [Copy] [Kimi¹]

Authors: Slava Shechtman ; Alex Sorin

A sinusoidal representation of speech is an alternative to the source-filter model. It is widely used in speech coding and unit-selection TTS, but is less common in statistical TTS frameworks. In this work we utilize Regularized Cepstral Coefficients (RCC) estimated in mel-frequency scale for amplitude spectrum envelope modeling within an HMM-based TTS platform. Improved subjective quality for mel-frequency RCC (MRCC) combined with the sinusoidal model based reconstruction is reported, compared to the state-of-the-art MGC-LSP parameters

#21 Improved training of excitation for HMM-based parametric speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Yoshinori Shiga ; Tomoki Toda ; Shinsuke Sakai ; Hisashi Kawai

This paper presents an improved method of training for the unvoiced filter that comprises an excitation model, within the framework of parametric speech synthesis based on hidden Markov models. The conventional approach calculates the unvoiced filter response from the differential signal of the residual and voiced excitation estimate. The differential signal, however, includes the error generated by the voiced excitation estimates. Contaminated by the error, the unvoiced filter tends to be overestimated, which causes the synthetic speech to be noisy. In order for unvoiced filter training to obtain targets that are free from the contamination, the improved approach first separates the non-periodic component of residual signal from the periodic component. The unvoiced filter is then trained from the non-periodic component signals. Experimental results show that unvoiced filter responses trained with the new approach are clearly noiseless, in contrast to the responses trained with the conventional approach.

#22 Excitation modeling based on waveform interpolation for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: June Sig Sung ; Doo Hwa Hong ; Kyung Hwan Oh ; Nam Soo Kim

It is generally known that a well-designed excitation produces high quality signals in hidden Markov model (HMM)-based speech synthesis systems. This paper proposes a novel techniques for generating excitation based on the waveform interpolation (WI). For modeling WI parameters, we implemented statistical method like principal component analysis (PCA). The parameters of the proposed excitation modeling techniques can be easily combined with the conventional speech synthesis system under the HMM framework. From a number of experiments, the proposed method has been found to generate more naturally sounding speech.

#23 Formant-based frequency warping for improving speaker adaptation in HMM TTS [PDF] [Copy] [Kimi¹]

Authors: Xin Zhuang ; Yao Qian ; Frank K. Soong ; Yijian Wu ; Bo Zhang

In this paper we investigate frequency warping explicitly on the mapping between the first four formant frequencies of 5 long vowels recorded by source and target speakers. A universal warping function is constructed for improving MLLR-based speaker adaptation performance in TTS. The function is used to warp the frequency scale of a source speakers data toward that of the target speakers data and an HMM of frequency warped feature of the source speaker is trained. Finally, the MLLR-based speaker adaptation is applied to the trained HMM for synthesizing the target speakers speech. When tested on a database of 4,000 sentences (source speaker) and 100 sentences of a male and a female speaker (target speakers), the formant based frequency warping has been found very effective in reducing log spectral distortion over the system without formant frequency warping and this improvement is also confirmed subjectively in AB preference and ABX speaker similarity listening tests.

#24 Improved modelling of speech dynamics using non-linear formant trajectories for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Hongwei Hu ; Martin J. Russell

This paper describes the use of non-linear formant trajectories to model speech dynamics. The performance of the non-linear formant dynamics model is evaluated using HMM-based speech synthesis experiments, in which the 12 dimensional parallel formant synthesiser control parameters and their time derivatives are used as the feature vectors in the HMM. Two types of formant synthesiser control parameters, named piecewise constant and smooth trajectory parameters, are used to drive the classic parallel formant synthesiser. The quality of the synthetic speech is assessed using three kinds of subjective tests. This paper shows that the non-linear formant dynamics model can improve the performance of HMM-based speech synthesis.

#25 Global variance modeling on the log power spectrum of LSPs for HMM-based speech synthesis [PDF] [Copy] [Kimi¹]

Authors: Zhen-Hua Ling ; Yu Hu ; Lirong Dai

This paper presents a method to model the global variance (GV) of log power spectrums derived from the line spectral pairs (LSPs) in a sentence for HMM-based parametric speech synthesis. Different from the conventional GV method where the observations for GV model training are the variances of spectral parameters for each training sentence, our proposed method directly models the temporal variances of each frequency point in the spectral envelope reconstructed using LSPs. At synthesis stage, the likelihood function of trained GV model is integrated into the maximum likelihood parameter generation algorithm to alleviate the over-smoothing effect on the generated spectral structures. Experiment results show that the proposed method can outperform the conventional GV method when LSPs are used as the spectral parameters and improve the naturalness of synthetic speech significantly.